Model Selection

Long Video Understanding

# Long Video Understanding

Qwen2.5 VL 7B Instruct GGUF

Qwen2.5-VL is the latest vision-language model from the Qwen family, featuring powerful visual understanding and multimodal processing capabilities, supporting image and video analysis with structured output.

Image-to-Text English

Docscopeocr 7B 050425 Exp

docscopeOCR-7B-050425-exp is a model fine-tuned based on Qwen/Qwen2.5-VL-7B-Instruct, focusing on document-level OCR, long-context visual language understanding, and accurate image-to-text conversion of mathematical LaTeX formats.

Transformers Supports Multiple Languages

Vamba Qwen2 VL 7B

Vamba is a hybrid Mamba-Transformer architecture that achieves efficient long video understanding through cross-attention layers and Mamba-2 modules.

Videochat Flash Qwen2 5 7B 1M Res224

VideoChat-Flash is a multimodal model built upon UMT-L and Qwen2.5-7B-1M, supporting long video understanding with an extended context window of 1M.

Transformers English

Qwen2.5 VL 3B Instruct 4bit

Qwen2.5-VL is the latest vision-language model in the Qwen family, featuring enhanced visual understanding, agent capabilities, and long video processing.

Transformers English

Internvl 2 5 HiCo R64

A video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, improving existing MLLMs by enhancing the perception of fine-grained details and capturing long-term temporal structures

Transformers English

Internvideo2 5 Chat 8B

InternVideo2.5 is a video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, built upon InternVL2.5. It significantly improves existing MLLM models by enhancing the ability to perceive fine-grained details and capture long-term temporal structures.

Transformers English

Llava Video 7B Qwen2 TPO

LLaVA-Video-7B-Qwen2-TPO is a video understanding model based on LLaVA-Video-7B-Qwen2 with temporal preference optimization, demonstrating excellent performance across multiple benchmarks.

LongVA-7B-TPO is a video-text model derived from LongVA-7B through temporal preference optimization, excelling in long video understanding tasks.

Videochat Flash Qwen2 7B Res224

A multimodal model built on UMT-L and Qwen2-7B, supporting long video understanding with only 16 tokens per frame and an extended context window of 128k.

Transformers English

Apollo LMMs Apollo 7B T32

Apollo is a series of large multimodal models focused on video understanding, excelling in processing up to one-hour-long video content, supporting complex video QA and multi-turn dialogues.

Transformers English

Apollo LMMs Apollo 1 5B T32

Apollo is a series of large multimodal models focused on video understanding, excelling in tasks such as long video content comprehension, temporal reasoning, and complex video question answering.

Longvu Llama3 2 1B

LongVU is a spatio-temporal adaptive compression technology designed for long video language understanding, aiming to efficiently process long video content and enhance language comprehension.

Oryx-1.5-7B is a 7B-parameter model developed based on the Qwen2.5 language model, supporting a 32K token context window and specializing in efficiently processing visual inputs of arbitrary spatial dimensions and durations.

Text-to-Video Supports Multiple Languages

Longvu Llama3 2 3B

LongVU is a spatio-temporal adaptive compression technology for long video language understanding, designed to efficiently process long video content.

Longvu Qwen2 7B

LongVU is a multimodal model based on Qwen2-7B, focusing on long video language understanding tasks and employing spatio-temporal adaptive compression technology.

Llava Video 7B Qwen2

The LLaVA-Video model is a 7B-parameter multimodal model based on the Qwen2 language model, specializing in video understanding tasks and supporting 64-frame video input.

Transformers English

Kangaroo is a powerful multimodal large language model specifically designed for long video understanding, supporting bilingual dialogue (Chinese-English) and long video inputs.

Transformers Supports Multiple Languages

Timesformer Large Finetuned K400

TimeSformer is a video classification model based on spatio-temporal attention mechanism, specifically designed for video understanding tasks.

Video Processing

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase